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Abstract 

The  on-going  goal  of  this  project  is  to  develop  analytical  tools  and  computational  models  for  vision  to  be 
used  as  a  sensor  for  the  purpose  of  control.  Vision,  as  in  remote  passive  distributed  sensing,  whether  in  the 
visible  or  other  spectra,  is  a  flexible,  powerful  and  cheap  sensory  modality  for  unmanned  vehicles  to  interact, 
with  complex,  unknown,  uncertain  and  dynamic  environments.  To  that  end,  algorithms  and  models  must  be 
developed  to  causally  infer  geometric  (shape),  photometric  (reflectance)  and  dynamic  (motion)  properties  of 
objects  and  scenes.  In  this  report  we  describe  progress  oil  all  areas,  including  the  following  breakthroughs: 

1.  We  have  fully  developed,  after  their  introduction  early  into  the  project,  Dynamic  Active  Appeamnct 
Models  [15,  16]  to  describe  variations  in  shape  (domain  deformation),  reflectance  (contrast.)  and  motion 
via  non-linear  conditionally  Gaussian  processes  that  capture  complex  phenomena  such  as  the  motion 
of  faces,  flames,  foliage.  We  have  further  extended  these  models  to  take  into  account  occlusions 
phenomena  [27],  which  are  fundamental  in  vision. 

2.  We  have  developed  filtering  and  identification  techniques  for  a  class  of  (Hammerstein)  dynamical  models 
driven  by  non- Gaussian  processes  that  are  particularly  well  suited  to  model  human  motion  in  video 
[3,  43,  6];  we  have  also  developed  computational  and  modeling  tools  to  enforce  priors  on  dynamically 
moving  shapes  [11,  26]  to  enable  tracking  through  occlusions  or  with  partial  information. 

3  We  have  developed  forward  diffusion  models  and  numerical  schemes  for  integrating  the  ensuing  partial 
differential  euations  for  shape  estimation  from  diffusion  images  (defocus  or  motion  blur)  [20]. 

4.  We  have  proven  the  observability  and  identifiability  of  ego-motion  in  the  presence  of  visual  and  inertial 
measurements  [29],  and  characterized  the  set  of  sufficiently  exciting  inputs.  In  addition,  we  have  proved 
that  the  landscape  of  local  minima  in  visual  motion  estimation  moving  forward  can  be  removed  by 
positivity  constraints  [47]. 

5.  We  have  developed  a  framework  for  visual  tracking  of  severely  deforming  objects,  by  proposing  the 
first  ever  (infinite-dimensional)  observer  capable  of  predicting  general  (diffeomorphic)  deformations, 
rather  than  just  a  finite-dimensional  group  (unpublished). 

6.  We  have  been  able  to  characterize  visual  invariants  to  general  viewpoint  and  contrast  changes  (unpub¬ 
lished). 
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In  addition,  we  have  formulated  a  series  of  conjectures  and  developed  a  research  program  for  the  classification 
of  time  series  [42,  40] ,  together  with  the  first  steps  to  build  dynamic  action  dictionaries  [41,  54],  Other 
contributions  include  algorithms  for  vision-based  and  lidar-based  localization  [21],  proximity  distribution 
kernels  for  visual  category  recognition  [32],  and  wide-sense  filtering  on  Lie  groups  [10].  Some  of  this  work 
has  been  conducted  in  collaboration  with  Stan  Osher  at  UCLA  [11.  20].  Rene  Vidal  at  Johns  Hopkins  [56], 
Anthony  Yezzi  at  Georgia  Tech. 


Outcomes  at-a-glance 

This  project  has  resulted  in  a  number  of  technical  achievements,  and  some  breakthroughs,  documented  in  17 
publications1  in  the  most  prestigious  conferences  and  journals  in  the  field  of  Computer  Vision,  including  a 
texbook  [19]. 

Some  of  the  students  and  postdocs  involved  in  the  project  have  found  placement  in  prestigious 
industrial  and  academic  institutions  in  the  US  and  abroad,  including  Dr.  Andrea  Vedaldi  -  awarded  the 
only  Outstanding  PhD  Award  from  the  UCLA  Computer  Science  Department  in  2008  now  a  Postdoc  at 
Oxford  University  after  turning  down  a  Faculty  Position  at  the  Ecole  Centrale  in  Paris,  Bynng-Woo  Hong, 
now  Assistant  Professor  at  Chung- A ng  University  in  Seoul  Korea,  Ilaibiri  Ling,  now  Assistant  Professor 
at  Temple  University,  Gregorio  Guidi,  now  an  Analist  at  the  Central  Bank  of  Italy,  Daniele  Fontanelli, 
a  researcher  at  the  University  of  Pisa,  and  Alessandro  Bissacco  now  Research  Engineer  at  Google  INC., 
Emmanuel  Prados,  Researcher  at  INR1A,  Grenoble  -  France. 

Our  work  has  also  sparked  the  attention  of  several  companies  that  have  supported  or  are  supporting 
corollary  activities,  including  Mitsubishi  Heavy  Industries,  Toshiba,  Sony  and  Panasonic  of  Japan  who  sup¬ 
ported  a  staff  researcher  to  visit  the  Vision  Laboratory  for  a  year  and  resulted  in  state-of-the  art  algorithms 
for  human  detection  in  video  (Toshiba)  and  stereo  reconstruction  (Panasonic),  and  Mobileye,  INC.  (Israel) 
that  collaborated  with  the  Vision  Lab  during  the  2005  DARPA  Grand  Challenge  prior  to  the  commencement 
of  this  project. 


Technical  Achievements 

Dynamic  Active  Appearance  Models 

Images  are  generated  by  a  complex  interplay  of  photometric  (illumination,  reflectance),  geometric  (shape, 
pose)  and  dynamic  (motion  deformation)  properties  of  the  scene.  These  factors  are  intermingled,  and 
non-uniquely  identifiable  in  an  image  (or  even  a  sequence  of  images),  so  that  there  are  infinitely  many 
combinations  of  photometric,  geometric  and  dynamic  processes  that  generate  the  same  image  sequences. 
When  the  goal  is  not  to  control  each  factor  individually,  but  instead  to  perform  classification,  for  instance 
the  detection,  recognition  and  localization  of  objects  or  events  of  interest,  one  seeks  for  the  simplest  possible 
models  that  capture  the  phenomenology  of  the  data.  In  the  past,  we  have  developed  models  of  so-called 
dynamic  textures  [14,  13]  that  capture  temporally  and  spatially  stationary  sequences.  This  model  has  been 
extended  further  to  represent  spatially  lion-stationary,  photometrically  and  temporally  stationary  sequences, 
which  we  have  called  Dynamic  Active  Appearance  Models  (DAAM)  [16.  15]. 

DAAM  are  considerably  more  powerful  than  dynamic  textures,  which  can  be  used  to  represent  the 
same  data  with  lower  complexity,  or  to  capture  more  complex  phenomena  such  as  a  flag  waving  in  the 
wind  or  a  talking  face  with  greater  fidelity  at  equal  complexity.  The  price  we  pay  for  that  benefit  is 
an  increased  complexity  of  the  inference  process  (filtering  and  identification).  In  fact,  whereas  Dynamic 
Textures  were  essentially  linear-Gaussiaii  dynamical  models,  DAAM  are  intrinsically  non-linear.  However 
the  models  are  conditionally  Gaussian  in  the  sense  that  shape  deformation  can  be  characterized  as  a  Gaussian 

1  Listed  in  the  reference  section  as  (10,  49,  43,  42,  16,  19,  35,  23,  25,  27,  20,  28,  3,  46,  15,  34,  11,  17,  18,  58,  38,  41.  22,  54, 
48,  61,  36,  24,  21,  40.  29,  62,  32,  26,  6,  8,  33,  47,  56,  52,  7,  60,  4,  5,  12,  51,  39,  53] 
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shape  space  (the  quotient  of  R2  modulo  the  affine  group);  conditioned  on  shape,  reflectance  is  represented 
as  an  affine  shape  spaee  (R+  modulo  monotonic  continuous  transformations);  conditioned  on  reflectance 
and  shape,  the  temporal  evolution  of  the  (finite-dimensional)  representation  is  modeled  as  the  out  put  of 
a  linear-Gaussian  dynamical  model.  So,  although  non-linear  and  lion-Gaussian,  the  model  is  conditionally 
Gaussian.  Unfortunately,  filtering  and  identification  of  these  models  is  no  longer  simply  performed  with 
standard  subspace  identification  techniques  [56],  but  instead  finite-element,  methods  and  identification  of 
hybrid  systems  must  be  brought  to  bear. 

As  we  have  done  for  the  case  of  Dynamic  Textures  in  the  past,  after  developing  the  model  and  testing 
its  generative  power  by  measuring  the  matching  fidelity  of  the  statistics  of  second  and  higher-order  of  the 
data,  we  intend  to  exploit  these  models  for  the  purpose  of  decision,  specifically  to  detect,  localize,  classify 
and  recognize  events  of  interest  in  video.  Whereas  data  fidelity  and  complexity  are  all  that  matters  for 
communication  purposes  (transmission,  compression),  recognition  requires  handing  nuisance  transformations 
and  endowing  the  models  with  a  metric  that  enables  efficient  computation  of  distances  and  the  proper 
definition  of  prior  probability  measures.  This  direction  of  investigation  will  be  pursued  in  future  work. 
Dynamical  Models  of  Human  Motion 

Human  motion  presents  a  significant  challenge  due  to  the  importance  of  the  application  domain  (secu¬ 
rity,  persistent  ISR),  the  variability  in  which  they  can  appear  in  images  (different  clothing,  different  pose, 
different  illumination,  partial  occlusions)  and  move  (different  gaits),  their  complex  dynamics  (humans  are 
essentially  multiple  inverted  penduli).  Our  goal  is  to  develop  techniques  to  detect,  localize  and  recognize 
actions  regardless  of  the  individual,  and  to  recognize  the  individual  regardless  of  the  action  clothing,  pose, 
illumination  etc.  The  first  step  in  this  program  is  to  extract  a  time  series  from  the  measurements.  This 
can  be  done  in  a  number  of  ways,  form  trivial  (consider  the  video  itself  as  a  time  series  of  pixel  intensity 
values)  to  complex  (estimate  silhouettes  of  moving  humans,  seen  from  a  moving  platform  [30],  including  the 
enforcement  of  prior  knowledge  on  their  shape  and  deformation  [11,  25,  17]).  Once  that  is  done,  we  cannot 
simply  compare  two  time  series  as  if  they  were  two  functions  of  time,  using  any  number  of  functional  norms, 
because  of  the  large  variability  due  to  nuisance  factors  (pose,  initial  condition,  speed  of  execution  etc.).  Long 
ago,  in  [2],  we  have  been  the  first  to  propose  interpreting  such  time  series  as  the  output  of  dynamical  models, 
and  to  perform  decisions  such  as  the  recognition  of  an  individual  form  her  gait  -  by  comparing  statistics 
of  the  models  identified  from  the  data.  That  has  worked  very  well  for  stationary  sequences,  for  instance 
segments  of  walking,  running,  jumping,  limping  etc.  In  the  past  we  have  also  extended  these  models  to 
piecewise  stationary  statistics,  which  led  us  to  develop  techniques  to  perform  filtering  and  identification  of 
hybrid  (jump-linear)  models  [57,  55.  59].  However,  for  more  complex  and  transient  motions,  it  was  not  clear 
at  the  beginning  of  this  project  how  one  could  factor  out  nuisance  factors  such  as  the  speed  of  execution  of 
an  action,  or  the  initial  condition,  in  a  computationally  efficient  manner.  During  the  course  of  this  project, 
we  have  developed  (Mercer)  Kernels  between  dynamical  models  that  enables  one  to  compare  them  while 
discounting  the  effects  of  various  factors,  depending  on  the  application,  including  initial  time,  initial  condi¬ 
tion,  distribution  of  the  input  sequences,  speed  of  execution  etc.  [3].  This  work  required  considerable  care 
in  order  to  properly  treat,  the  non-Gaussian  statisics  of*  the  input ,  which  are  filtered  through  non-minimum 
phase  models. 

In  addition,  we  have  also  developed  techniques  for  imposing  prior  knowledge  on  the  evolving  shape  of 
objects  [11],  and  most,  recently  we  have  developed  (infinite-dimensional)  models  to  identify  the  state  of  a 
model  tracking  a  deforming  shape.  This  enables  the  prediction  not  just  of  the  motion  of  an  object,  but  also 
of  its  defor-rnation ,  which  enables  tracking  objects  through  severe  occlusions  [26]  whereas  previous  schemes 
could  predict  pose,  but  not  shape  after  the  target  became  unoccluded. 

Shape  From  Anisotropic  Diffusion 

During  the  course  of  this  project  we  have  completed  a  comprehensive  research  program,  initiated  in 
previous  years,  on  the  exploitation  of  diffusion  cues,  from  motion  blur  to  defocus  to  confocal  imaging,  for  the 
estimation  of  shape  and  reflectance  properties  of  scenes.  In  later  investigations  we  have  also  extended  this 
to  infer  independent  motions  in  a  scene.  This  work  has  been  collated  in  a  textbook  published  by  Springer 
Verlag  in  December  2006  [19]. 

In  particular,  one  approach  used  for  shape  and  motion  optimization  lias  proven  especially  successful,  as 
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we  describe  next.  In  the  presence  of  unknown  motion  or  unknown  optical  configuration,  the  imaging  process 
is  essentially  a  (blind)  convolution  process  Hence,  most  prior  work  on  shape  and  motion  inference  focused 
on  blind  deconvolution,  a  notoriously  ill-posed,  numerically  ill-conditioned  process.  In  [20],  in  collaboration 
with  former  student  P.  Favaro  (now  at  Heriot-Watt  and  the  University  of  Ediinburgh)  and  colleague  S.  Osher, 
we  have  proposed  a  revolutionary  approach  based  on  a  forward  diffusion  with  a  space- varying  stopping  time. 
This  has  vastly  exceeded  the  state  of  the  art  in  both  accuracy  and  computational  complexity  in  the  estimation 
of  shape  from  defocus  as  well  as  motion  blur. 

Visual-Inertial  Integration 

It  is  generally  known  that  vision  and  inertial  are  complementary  modalities:  The  former  is  slow,  global 
modulo  visbility  -  but  yields  only  measurements  up  to  a  Euclidean  reference  and  an  unknown  scale;  the 
latter  is  fast  and  local,  subject  to  drift.  Several  approaches  have  been  proposed  to  integrate  the  two,  but  all 
of  them,  without  exception  that  we  know  of,  fail  to  address  the  two  issues  that  are  of  most  critical  important 
in  practical  application,  that  is  (a)  the  calibration  between  the  camera  reference  frame  and  the  inertial 
frame  (usually  assumed  known  through  some  delicate  metrology),  and  (b)  the  dealing  with  gravity  (usually 
assumed  known  through  the  usual  cohort  of  ad-hoc  fixes  common  in  the  inertial  navigation  practice).  This 
poses  an  obstacle  to  the  deployment  of  vision-inertial  measurements,  when  (a)  no  accurate  calibration  is 
provided,  and  (b)  when  small  errors  in  gravity  (usually  estimated  by  averaging  acceleration  over  a  long  time 
frame)  cause  long-range  divergence  of  the  ensuing  filter. 

In  [29]  we  have  eviscerated  the  observabilit  y  and  identifi ability  of  a  full  Euclidean  frame  from  joint  vision 
and  inertial  measurements.  We  have  been  able  to  show  that  (a)  the  camera-iinu  calibration  is  identifiable , 
so  it  does  not  need  to  be  accurately  known  ahead  of  time  and  can  be  refined  on-line,  and  (b)  gravity  is 
obseruable  from  joint,  vision  and  inert  ial  measurements,  so  it,  can  be  updated  on-line  while  the  overall  filter 
is  guaranteed  to  be  observable  (unlike  the  case  of  pure  inertial  navigation).  The  observability/identifiability 
conditions  impose  that  the  input  sequence  be  sufficiently  exciting.  We  have  characterized  the  sufficiently 
exciting  inputs  in  terms  of  a  motion  sequence,  corresponding  to  an  “autocalibrating  motion”  that  can  be 
performed  at  the  beginning  of  an  experiment,  without  the  need  for  additional  instrumentation. 

Since  the  motion  usually  undergone  during  steady-state  guidance  (constant  velocity)  does  not  correspond 
to  a  sufficiently  exciting  input,  in  general  the  gravity  and  camera-imn  calibration  states  can  drift  during 
steady-state  motion.  Therefore.  While  in  practice  this  problem  can  be  fixed  in  practice  with  suitable 
gain  scheduling  techniques,  there  remains  the  need  to  analyze  the  overall  identifiability/observability  of  a 
dynamical  model  whose  (non-linear)  state-space  is  partitioned  into  regions  that  are  observable,  and  regions 
where  the  parameters  of  the  model  are  not  identifiable.  Questions  we  intend  to  address  in  the  future  is 
whether  transients  following  transitions  across  such  region  boundaries  are  sufficient  to  absorb  the  drift  (and 
therefore  no  specific  action  is  needed  in  the  design  of  observers),  or  whether  there  is  an  optimal  strategy 
to  lock  the  unobservable  states  and  the  unidentifiable  parameters  when  the  system  goes  through  regions  of 
state-space,  and  of  inputs,  that  correspond  to  non-identifiable,  non-observable  conditions. 

Tracking  Deforming  Shapes  (unpublished) 

Tracking  deformable  objects  in  video  is  an  important  problem  in  numerous  application;  we  have  already 
described  the  specific  case  of  humans,  but  a  variety  of  other  objects  are  of  interest.  Note  that  even  rigid 
objects  can  yield  deformable  domains  when  imaged  through  a  projection,  and  the  more  complex  the  three- 
dimensional  shape,  the  more  complex  the  deformation  of  the  projection.  The  usual  approach  to  tracking  such 
deformable  objects  consists  in  either  treating  each  frame  as  an  independent  entity  and  segmenting  the  object 
from  the  background  based  on  pictorial  (reflectance)  cues,  or  in  tracking  a  finite- dimensional  representation 
of  the  object  corresponding  to  its  coarse  motion  (e.g.  affine).  The  former  approach  has  nothing  “dynamic’' 
to  it,  and  therefore  has  no  predictive  power  whatsoever  (in  the  presence  of  missing  data,  the  estimate 
remains  locked  to  the  last  available  measurement).  The  latter  can  extrapolate  coarse  motion  (e.g.  centroid 
and  second-moment  matrix),  but  cannot  extrapolate  deformation.  For  instance,  when  a  walking  person  is 
partially  occluded,  the  approach  extrapolates  a  moving  cardboard  figure,  but  not  the  actual  deformation 
due  to  individual  limb  motion  [11]. 

Wc  have  proposed  what,  is,  to  the  best  of  our  knowledge,  the  first  approach  to  design  a  proper  filter 
(observer)  in  the  infinite-dimensional  space  of  shapes  (closed  Jordan  curves).  This  is  based  on  endowing  the 
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space  with  a  Rieinannian  (Sobolev)  metric,  then  shooting  geodesics  from  the  current  best  estimate  of  the 
state  using  the  exponential  map  in  the  infinite-dimensional  Lie  group  of  diffeoinorphisms,  finally  correcting 
the  prediction  when  a  new  measurement  of  a  curve  becomes  available.  This  work  has  yet  to  be  published, 
although  a  technical  report  has  recently  been  deposited. 

The  next  step  in  this  program  consists  in  making  the  measurement  equation  one  step  closer  to  the  actual 
data,  which  are  images.  So,  instead  of  assuming  that  we  have  an  intermediate  representation  that  yields 
a  pseudo-measurement  (a  curve),  we  intend  to  measure  images  directly,  thus  estending  the  framework  of 
DEFORMOTION  [44]  to  a  proper  dynamical  system  observer  framework. 

Visual  Invariants  (unpublished) 

It  was  widely  believed  that  viewpoint  invariant  image  statistics  did  not  exist  except  for  planar  object, 
due  to  [9].  In  [50],  we  showed  that  viewpoint  invariant  statistics  always  exist,  for  scenes  of  arbitrary  shape, 
provided  Lambertian  reflection.  At  the  same  time,  a  significant  avenue  of  research  deals  with  contrast- 
invariant  image  processing,  pioneered  by  Koendernik  [31]  and  Morel  [1].  Unfortunately,  the  statistics  that 
are  invariant  to  viewpoint  are  not  invariant  to  contrast,  and  vice-versa.  Therefore,  we  have  recently  turned 
our  attention  to  the  problem  of  either  identifying  viewpoint- and- contrast  invariant  statistics,  or  disprove 
their  existence. 

We  have  recently  shown  that  such  viewpoint-and-contrast,  invariants  exist,  and  they  are  supported  on 
a  zero- measure  subset  of  the  image  domain  They  are  related  to  topological  constructions  derived  from 
the  Morse-Smale  complex,  which  we  have  called  sub-Reeb  trees.  While  this  is  already  interesting,  because 
it  illustrates  why  it  is  possible  to  compress  images  so  efficiently  without  much  perceptual  effect,  what  is 
remarkable  is  the  fact  that  this  zero- measure  object  is  actually  a  sufficient  statistic ,  meaning  that  from  it 
one  can  recover  exactly  the  original  image,  modulo  the  action  of  a  domain  difleomorphism  (viewpoint  change) 
and  contrast  transformation  (illumination  change).  This  result,  which  we  believe  to  be  of  great  theoretical 
significance,  has  not  yet  been  published,  but  the  main  argument  and  the  proofs  have  been  deposited  in  a 
technical  report  [45],  coauthored  with  colleague  P.  Petersen. 

These  results  pertain  to  Lambertian  scenes  and  assume  that  the  image  is  approximated  by  a  Morse 
function.  While  this  is  a  fair  assumption  for  many  subsets  of  the  image,  in  the  sense  that  it  is  possible 
to  find  image  statistics  that  are  to  good  approximation  piecewise  smooth  [37],  there  remains  the  need  to 
first  partition  the  image  into  domains  where  the  Morse  assumption  is  satisfied  to  a  reasonable  extent.  We 
have  recently  begun  experimenting  with  various  approach  for  texture  segmentation,  that  would  provide  a 
preprocessing  step  for  the  construction  of  a  collection  of  sub-Reb  trees.  We  are  studying  the  theoretical 
properties  that  such  as  “sparse  invariant  coding”  would  have  in  the  overall  problem  of  performing  decisions 
from  image  data,  and  at  the  same  time  we  are  considering  numerical  and  computational  implications  that 
these  representations  would  have,  in  the  sense  of  enabling  the  design  of  more  efficient  image-based  and 
video-based  classification  and  recognition  algorithms. 
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